Data Science Concepts and Analysis

Week 1: Welcome to PSTAT 100!

  • Course introduction

  • Course structure

  • Getting started

This week

  • Course introduction
    • Perspectives on data science
    • Scope and topics
  • Course structure
    • Format and schedule
    • Materials and resources
    • Assignments and assessment
    • Course policies

Course introduction

  • Perspective on data science: “lifecycle”

  • Course scope

What’s data science?

Currently understood, “data science” encompasses a wide range of activities that involve uncovering insights from quantitative information.

Data scientists typically combine specific interests (“domain knowledge”, e.g., biology) with computation, mathematics, and statistics and probability to contribute to knowledge in their communities.

  • Skills combine in different proportions – no singular background among practitioners.

  • Diverse communities – science, industry, government, medicine, academia, etc.

Data science lifecycle

There is an emerging consensus that doing data science involves proceeding through a lifecycle: a repeated sequence of steps.

  • Less consensus at the moment about how many steps and what they are (google ‘data science lifecycle’ and check out all the flowcharts).

Most versions of the ‘data science lifecycle’ involve a few categories of steps:

  • Project planning

  • Data collection and organization

  • Exploration

  • Analysis

  • Communication and interpretation

(Perhaps it is this idea of a lifecycle that characterizes data science as distinct from other quantitative fields.)

Case studies: a preview

Case study 1: ACE and health

Association between adverse childhood experiences and general health, by sex.

Case study 1: ACE and health

You will:

  • process and recode 10K survey responses from CDC’s 2019 behavior risk factor surveillance survey (BRFSS)
  • cross-tabulate health-related measurements with frequency of adverse childhood experiences

Case study 2: SEDA

Education achievement gaps as functions of socioeconomic indicators, by gender.

Case study 2: SEDA

You will:

  • merge test scores and socioeconomic indicators from the 2018 Standford Education Data Archive by school district
  • visually assess correlations between gender achievement gaps among grade schoolers and socioeconomic indicators across school districts in CA

Case study 3: Paleoclimatology

Sea surface temperature reconstruction over the past 16,000 years.

Case study 3: Paleoclimatology

Clustering of diatom relative abundances in pleistocene (pre-11KyBP) vs. holocene (post-11KyBP) epochs.

Case study 3: Paleoclimatology

You will:

  • explore ecological community structure from relative abundances of diatoms measured in ocean sediment core samples spanning ~15,000 years
  • use dimension reduction techniques to obtain measures of community structure
  • identify shifts associated with the transition from pleistocene to holocene epochs

Case study 4: Discrimination at DDS?

Apparent disparity in allocation of DDS benefits across racial groups.

Case study 4: Discrimination at DDS?

Expenditure is strongly associated with age.

Case study 4: Discrimination at DDS?

Correcting for age shows comparable expenditure across racial groups.

Case study 4: Discrimination at DDS?

You will:

  • assess the case for discrimination in allocation of DDS benefits
  • identify confounding factors present in the sample
  • model median expenditure by racial group after correcting for age

About the course

Scope

This course is about developing your data science toolkit with foundational skills:

  1. Core competency with R data science libraries
  2. Critical thinking about data
  3. Visualization and exploratory analysis
  4. Application of statistical concepts and methods in practice
  5. Communication and interpretation of results
  6. Ethical data science

What’s unique about PSTAT100?

There are a few distinctive aspects:

  • multiple end-to-end case studies
  • question-driven rather than method-driven
  • emphasis on project workflow
  • data storytelling and communication

Limitations

There are also some things we probably won’t cover:

  • Predictive modeling or machine learning (PSTAT 131)
  • Algorithm design and implementation (CS)
  • Techniques and methods for big data (PSTAT 135)
  • Theoretical basis for methods

Weekly Pattern

We’ll follow a simple weekly pattern:

  • Mondays
    • Lecture
    • Assignments due 11:59pm PST
  • Tuesdays
    • Section
  • Wednesdays
    • Lecture
    • Late work due 11:59pm PST

Pages

Course page Primary use
Canvas Announcements and links to content
tinyurl.com/pstat100 Computing and distribution
pstat100.lsit.ucsb.edu Computing

Tentative schedule

Week Topic Subjects Lifecycle
1 Introduction What’s data science?
2 Tidy data Import and organization Collect/Acquiant/Tidy
3 Sampling Informative vs. uninformative data Collect/Acquaint/Tidy
4 Visualization Plot types, aesthetics, principles Explore
5 Exploratory analysis Density estimation and descriptive statistics Explore/Analyze

Tentative schedule

Week Topic Subjects Lifecycle
6 Exploratory analysis Dimension Reduction Explore/Analyze
7 Regression and causality Linear regression Analyze/Interpret
8 Regression and causality Non-linear models Analyze/Interpret
9 Classification Logistic regression etc Analyze/Interpret
10 TBD TBD

The last week is flex time to explore other topics or extend coverage of previous topics.

Assessments

  • Labs (20% final grade weight, 10 pts each)
    • Short/moderate-length guided programming assignments
    • Given weekly through week 8
    • Collaboration encouraged; individual submissions required

Lab objective: introduce and develop core skills with data science libraries in R.

Assessments

  • Homeworks (50% final grade weight, 50 pts each)
    • Applications of course ideas and lab skills to analyses of real datasets
    • 4 assignments, released/due biweekly
    • Collaboration encouraged and group submissions allowed

Homework objective: practice workflow and explore case studies.

Assessments

  • Project (30% final grade weight)
    • Open-ended data analysis based on your interests
    • final report due end-of-quarter
    • Collaboration expected

Project objective: apply learned skills to a problem of your choosing.

Policies

  • Deadlines and late work
    • One-hour grace period on all deadlines
    • One free late on any assignment (except final project report)
    • 75% partial credit thereafter
    • No late work beyond 72 hours after deadline without instructor permission

Artificial intelligence

  • LLMs (ChatGPT etc) are allowed BUT…

  • The less you know the more likely you are to be convinced by misinformation

  • Ask yourself “Am I using it to avoid work? Or am I using it to help me develop an understanding?”

  • If you use it, you must cite it and state how it was used (on all assignments)

PSTAT 100 lifecycle

In this course, we’ll articulate the lifecycle in terms of the following steps.

  1. Hypothesize: question formulation/refinement.
  2. Collect: go out and sample or acquire data ‘second-hand’.
  3. Acquaint: get to know your dataset; make friends!
  4. Tidy: clean up and organize your data.
  5. Explore: search for patterns and structure.
  6. Analyze: seek to understand.
  7. Interpret: explain the meaning of your analysis.

This week

  • Attend your lab and meet your TA.

  • Confirm you can connect to Rstudio and render a quarto document

PSTAT 100 lifecycle

No data science lifecycle would be complete without a flowchart!

Notice the multiple entry points – some projects start with a focused question; others, with a dataset.

Data Science - what language?

  • R:My preference for exploring, tidying, and visualzing data
  • Python: My preference for machine learning and advanced models
  • Can and should use both! This class: R

Reference: R for Data science, https://r4ds.had.co.nz/

Illustrating the cycle

We’ll walk through a very simple example to get a concrete idea of how the cycle works.

Question: How do animals’ brains scale with their bodies?

You will see some codes displayed as we walk through the example.

Don’t worry about understanding them – that’s what this course is for!

Focus on the process.

Step 0: Hypothesize

Question formulation or refinement

There are lots of datasets out there with brain and body weight measurements, so let’s make the question a bit more specific:

  • What is the relationship between an animal’s brain and body weight?

It might sound simple, but the relationship is thought to contain clues about evolutionary patterns pertaining to intelligence.

Step 1: Collect

Acquire a publicly available dataset comprising average body and brain weights for 62 mammals.

# import brain and body weights
bb_weights <- read_csv('data/allison1976.csv') %>% 
  select(1:3)
head(bb_weights, n=5)
# A tibble: 5 × 3
  species                body_wt brain_wt
  <chr>                    <dbl>    <dbl>
1 Africanelephant        6654      5712  
2 Africangiantpouchedrat    1         6.6
3 ArcticFox                 3.38     44.5
4 Arcticgroundsquirrel      0.92      5.7
5 Asianelephant          2547      4603  

Units of measurement

  • body weight in kilograms

  • brain weight in grams

Step 2: Acquaint

Especially because we didn’t collect this data ourselves, we should do a little background research to understand where the data came from (Allison et al. 1976) and what limitations might exist:

  • Information about mammals only \(\longrightarrow\) no information about birds, fish, reptiles, etc.

  • Species weren’t chosen to represent mammalia \(\longrightarrow\) probably shouldn’t seek to generalize

  • Averages measured \(\longrightarrow\) ‘aggregated’ data (not individual-level)

So we can only explore the question narrowly for this particular group of animals using the data at hand – we don’t stand to learn anything generalizable.

  • Not a bad thing! We can still see what the data suggest and use results for hypothesis generation.

Step 3: Tidy

Clean up and organize

This dataset is already impeccably neat: each row is an observation for some mammal, and the columns are the two variables (average weight).

So no tidying needed – we’ll just check the dimensions and see if any values are missing.

# dimensions?
dim(bb_weights)
[1] 62  3
# missing values?
colSums(is.na(bb_weights))
 species  body_wt brain_wt 
       0        0        0 

Step 4: Explore

Look for patterns, structure, properites

Visualization is usually a good starting point.

# plot
ggplot(bb_weights, aes(x = body_wt, y = brain_wt)) +
  geom_point() + 
  theme_bw(base_size=16)

Step 4: Explore

Step 4: Explore

A simple transformation of the axes reveals a clearer pattern.

ggplot(bb_weights, aes(x = body_wt, y = brain_wt)) +
  geom_point() +
  theme_bw(base_size=16) +
  scale_x_log10(name = "body weight (kg)") +
  scale_y_log10(name = "brain weight (g)")

Step 4: Explore

Step 5: Analyze

The plot shows us that there’s a roughly linear relationship on the log scale:

\[\log(\text{brain}) = \alpha \log(\text{body}) + c\]

Step 6: Interpret

So what does that mean in terms of brain and body weights? A little algebra and we have:

\[(\text{brain}) \propto (\text{body})^\alpha\]

This is known as a “power-law” relationship: brain weight changes in proportion to a power of body weight.

So it appears that for these 62 mammals, the brain-body scaling is well-described by a power law. (Notice: no generalization/extrapolation!)

Step 0: Hypothesize

We can now engage in question refinement. Do other classes of animal exhibit the same power law relationship? Is it the same or different from animal to animal?

To investigate, we need richer data.

Step 1: Collect

A number of authors have compiled and published ‘meta-analysis’ datasets by combining the results of multiple studies.

Below we’ll import a few of these for three different animal classes.

# import metaanalysis datasets
reptiles <- read_csv('data/reptile_meta.csv')
birds <- read_csv('data/bird_meta.csv')
mammals <- read_csv('data/mammal_meta.csv')

Step 2: Acquaint

Where does this data come from? It’s kind of a convenience sample of scientific data:

  • Multiple studies \(\rightarrow\) possibly different sampling and measurement protocols

  • Criteria for inclusion unknown \(\rightarrow\) probably neither comprehensive nor representative of all such measurements taken

So these data, while richer, are still relatively narrow in terms of generalizability.

Step 3: Tidy

These datasets are still quite neat, but have a few minor things out of order.

# variable names and positions don't quite match up
# Original: Variable selection and renaming
rept_vars <- c(1, 2, 3, 4, 8, 10, 12)
bird_vars <- c(1, 2, 3, 4, 7, 12, 11)
mammal_vars <- c(1, 2, 3, 4, 7, 15, 12)
# Rename and combine datasets
rept <- reptiles %>%
  select(all_of(rept_vars)) %>%
  rename(body = `Body weight (g)`,
         brain = `Brain weight (g)`) %>%
  mutate(class = "Reptile")

bird <- birds %>%
  select(all_of(bird_vars)) %>%
  rename(body = `Body mass (g)`,
         brain = `Brain mass (g)`) %>%
  mutate(class = "Bird")

mamm <- mammals %>%
  select(all_of(mammal_vars)) %>%
  rename(body = `Body mass (g)`,
         brain = `Brain mass (g)`) %>%
  mutate(class = "Mammal")

# Combine datasets
data <- bind_rows(rept, mamm, bird)

Step 3: Tidy

In order to combine the datasets:

  • Select columns of interest;

  • Put in consistent order;

  • Give consistent names;

  • Concatenate.

I’ve suppressed the detail, but we can now inspect the result.

# missing values?
colMeans(is.na(data))
    Order    Family     Genus   Species       Sex      body     brain     class 
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.5740405 0.0000000 

Step 3: Tidy

  • This dataset has a number (actually quite a lot) of missing brain weight measurements

  • Many of the studies combined to form these datasets did not include that particular measurement.

# Aggregate by species
avg_weights <- data %>%
  drop_na() %>%
  group_by(class, Order, Species, Sex) %>%
  summarise(across(c(body, brain), mean)) %>%
  ungroup() %>%
  mutate(log_body = log(body),
         log_brain = log(brain))

Step 4: Explore

Looking at a similar plot and overlaying trend lines, we see the same power law relationship but with different proportionality constants for the three classes of animal.

# Create final plot with regression lines
ggplot(avg_weights, aes(x = log_body, y = log_brain, color = class)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

Step 4: Explore

Step 5: Analyze

So in this case there are three different linear relationships on the log scale that depend on animal class:

\[(\text{brain}) = \beta_1(\text{body})^{\alpha_1} \qquad \text{(mammal)} \\ (\text{brain}) = \beta_2(\text{body})^{\alpha_2} \qquad \text{(reptile)} \\ (\text{brain}) = \beta_3(\text{body})^{\alpha_3} \qquad \text{(bird)} \\ \beta_i \neq \beta_j, \alpha_i \neq \alpha_j \quad \text{for } i \neq j\]

Step 6: Interpret

It seems that the brain and body weights of the birds, mammals, and reptiles measured in these studies exhibit distinct power law relationships.

What would you investigate next?

  • Explore further?
    • Seek data on additional animal classes
    • Seek data on correlates of body weight
    • Seek data on other variables (lifespan, habitat, predation, etc.)
  • Inference and prediction?
    • Find better generalizable data
    • Estimate the \(\alpha_i\)’s and \(\beta_i\)’s
    • Find a way to predict brain weights for unobserved species
  • Something else?

Lather, rinse, repeat

Hopefully you can see how we could go through multiple iterations of the cycle, continuing to refine the question and produce more detailed analyses each time, until we arrive at a fuller understanding of the subject under study.

A comment

Notice that I did not mention the word ‘model’ anywhere!

This was intentional – it is a common misconception that analyzing data always involves fitting models.

  • Models are not not always necessary or appropriate

  • We learned a lot from plots alone

The Role of Models

  • A statistical model represents a set of assumption about how the data was generated.

  • Models can infomr statistical tests

  • Can be used to make predictions or forecasts and describe sources of variability.

  • Describe more complex aspects of the data that cannot be understood with simple visualizations and exploratory analysis alone

Scope for PSTAT 100

This term we’ll work on developing your data science toolkit with foundational skills:

  • Programming in R data science libraries

  • Critical thinking about sampling and generalizing from data

  • Visualization and exploratory analysis

  • Ethical data science

  • Statistical modeling

Throughout, we’ll explore applications of these tools to case studies.

DIKW Pyramid

Data is not information

To generate information from data we need:

  • Tools to generate, collect, or scrape data

  • Ability to clean and manipulate data to more usable forms

  • This class we will use the tidyverse

Information is not knowledge

To generate knowlege from information we need:

  • Tools for exploratory data analysis

    • Clustering: identify attributes for grouping distinct subsets of data

    • Summarizing: compact representation of data (e.g. mean, variance, skew, etc)

    • Visualization (ggplot)

Information is not knowledge

To generate knowledge we need:

  • Domain expertise and assumptions

  • Statistical and machine learning models

  • Ability to generalize about populations from sample

  • Ability to quantifying our uncertainty

Knowledge is not Wisdom

Wisdom comes from knowledge when we practice:

  • Ethical decision making

    • What are the expected outcomes of each decision?
    • Might there be unintended consequences?
    • Who do my decisions help and/or hurt?
  • Respecting Privacy

    • What steps do I need to take to keep user data private and secure?

Wisdom and Data Science

Some questions we might ask ourselves throughout the quarter:

  • Is my data representative of the population?

  • Does my data have any biases? Measurement error?

  • Is the data “fair”? How does my analysis affect different groups?

Berkeley Gender Discrimination Example

All Men Women
Applicants Admitted Applicants
Total 12,763 41% 8,442

Berkeley Gender Discrimination Example

Department Applicants (All) Admitted (All) Applicants (Men) Admitted (Men) Applicants (Women) Admitted (Women)
A 933 64% 825 62% 108 82%
B 585 63% 560 63% 25 68%
C 918 35% 325 37% 593 34%
D 792 34% 417 33% 375 35%
E 584 25% 191 28% 393 24%
F 714 6% 373 6% 341 7%
Total 4,526 39% 2,691 45% 1835 30%

What story does the data tell?

Cancer Deaths in the US



Cancer deaths in the US

Do you think cancer deaths (per 100,000 people) have risen or fallen since 1980?

What story does the data tell?

cancer_data <- read_csv("../../data/IHME/IHME_cancer.csv")

cancer_data |> filter(cause_name == "Total cancers",
                      age_name == "All ages",
                      metric_name == "Rate",
                      location_name == "United States of America") |> 
  ggplot() + geom_line(aes(x=year, y=val)) + theme_bw(base_size=16) + 
  ylab("Cancer Deaths (per 100000 people)") + 
  ggtitle("United States")

What story does the data tell?

cancer_data <- read_csv("../../data/IHME/IHME_cancer.csv")

cancer_data |> filter(cause_name == "Total cancers",
                      age_name == "All ages",
                      metric_name == "Rate",
                      location_name %in% c("Japan", "Uganda")) |> 
  ggplot() + geom_line(aes(x=year, y=val, col=location_name)) + theme_bw(base_size=16) + 
  ylab("Cancer Deaths (per 100000 people)")

What story does the data tell?

cancer_data <- read_csv("data/IHME/IHME_cancer.csv")

cancer_data |> filter(cause_name == "Total cancers",
                      age_name == "Age-standardized",
                      metric_name == "Rate") |> 
  ggplot() + geom_line(aes(x=year, y=val, col=location_name)) + theme_bw(base_size=16) + 
  ylab("Cancer Deaths (age-standardized)")

R and Tidyverse

Basic tidyverse operations we will make use of include:

  • Piping. Either |> or %>%

  • select, filter, and mutate

  • group_by and summarize